This notebook will be linked to the features mentioned in the Medium article with regard to Bokeh. Specifically we will use data on NYC apartments to look at the relationship between price and square footage, while showing off some cool features of Bokeh. To get the data just go to this GitHub Repo
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Range1d, HoverTool
from bokeh.embed import components
from bokeh.io import curdoc
from bokeh.themes import Theme
import pandas as pd
import numpy as np
This little function will make our plots show up nice and cleanly in Jupyter - adios Matplotlib!
output_notebook()
This data comes from a little pipeline I built that was outline in this medium article on AWS Lambda Pipelines. Basically our data is on New York City apartments, which is scraped from Craigslist over June and July 2019. This data from Craigslist has a few enrichments which brings in data from Mapquest and Walk Scores, but it should be pretty intuitive to understand.
Read in our data and let's convert the date column to a datefield
df = pd.read_csv('data/nyc_apartments.csv')
df['date'] = pd.to_datetime(df['datetime'], infer_datetime_format=True).dt.date
df.head()
Bokeh has something called a "ColumnDataSource", which will quickly become your best friend. You can read about it in the docs, but the high level way to think about it is it converts your Pandas dataframe to something Bokeh can easily use. You can see how we utilize this weapon of mass plotting in the charts below, but the general process is:
We can very easily create a ColumnDataSource with any dataframe.
source = ColumnDataSource(df)
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]
# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]
# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)
# Create our figure
p = figure(title="Price vs. Square Footage")
# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)
show(p)
The output isn't bad per se, but let's make it visually more appealing:
This can all be done with some easy code switches
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]
# Create color mappings
df_has_area['bedrooms'].unique()
# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]
# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)
# Create our figure, now with the sizing mode feature
p = figure(title="Price vs. Square Footage", sizing_mode="stretch_width", tools=[], toolbar_location=None)
# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)
# Grid lines and and font size
p.xgrid.grid_line_color, p.ygrid.grid_line_color = None, None
p.xaxis.major_label_text_font_size, p.yaxis.major_label_text_font_size = '11pt', '11pt'
p.title.text_font_size='14pt'
show(p)
Yes! What if I always want my title to be size 14? Or I always want there to be no grid? Having to type these in for every chart will get quite old quickly. Introducing themes!
curdoc().theme = Theme(json={'attrs': {
# apply defaults to Figure properties
'Figure': {
'toolbar_location': None,
'outline_line_color': None,
'min_border_right': 10,
'sizing_mode': 'stretch_width'
},
'Grid': {
'grid_line_color': None,
},
'Title': {
'text_font_size': '14pt'
},
# apply defaults to Axis properties
'Axis': {
'minor_tick_out': None,
'minor_tick_in': None,
'major_label_text_font_size': '11pt',
'axis_label_text_font_size': '13pt',
'axis_label_text_font': 'Work Sans'
},
# apply defaults to Legend properties
'Legend': {
'background_fill_alpha': 0.8,
}}})
Now let's use the code from our original plot. We see we get everything done for us automatically.
Again, we now don't specify any of the styling attributes manually.
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]
# Create color mappings
df_has_area['bedrooms'].unique()
# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]
# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)
# Create our figure
p = figure(title="Price vs. Square Footage")
# Plot our data
p.scatter(x='area', y='price', line_color='#000000', source=source, size=10)
show(p)
One nice feature of Bokeh is you can leverage Pandas to create columns and then use them in your plot. In this example we will map each discrete value for bedrooms to a color and then use that to color out plot.
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]
# Create color mappings
df_has_area['bedrooms'].unique()
# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)
# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]
# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)
# Create our figure
p = figure(title="Price vs. Square Footage")
# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)
show(p)
Let's add some of those cool tooltips like Tableau has!
# Create a dataframe where area is not null
df_has_area = df.copy()
df_has_area = df_has_area[df_has_area['area'].isnull() == False]
# Create color mappings
df_has_area['bedrooms'].unique()
# Create color column based on the bedroom number
bedroomMapping = {0: 'green', 1: 'red', 2: 'blue', 3: 'yellow', 4: 'purple', 5: 'black', 6: 'teal', None: 'gray'}
df_has_area['color'] = df_has_area['bedrooms'].map(bedroomMapping)
# Look at points within the 95 percentile
df_has_area = df_has_area[df_has_area['area'] < np.percentile(df_has_area['area'].values, 95)]
# Define our ColumnDataSource
source = ColumnDataSource(df_has_area)
# Create our figure
p = figure(title="Price vs. Square Footage")
# Plot our data
p.scatter(x='area', y='price', fill_color='color', line_color='#000000', source=source, size=10)
# Create our tooltip
tooltips = """
<div style="width:500px;">
<h5 style="color:#0015bc; display:inline; font-size:1.2em">Craigslist URL: </h5>
<h5 style="color:#000000; font-size: 1.2em; display:inline;">@url</h5>
</div>
<div class="tooltip-section">
<h5 style="color:#0015bc; display:inline; font-size:1.2em">Price ($): </h5>
<h5 style="color:#000000; font-size: 1.2em; display:inline;">$@price{0,0}</h5>
</div>
<div class="tooltip-section">
<h5 style="color:#0015bc; display:inline; font-size:1.2em">Square Footage: </h5>
<h5 style="color:#000000; font-size: 1.2em; display:inline;">@area{0,0}</h5>
</div>
"""
p.add_tools(HoverTool(tooltips=tooltips))
show(p)
That is just the beginning of Bokeh. If you do use Pandas a lot I highly encourage you to continue learning with Bokeh as it has really served me well for creating visualizations, especially if you are them a lot with colleagues.